Charting AI/ML Growth Across the World

Mahika Jaguste , IIT Gandhinagar, mahika.oj@iitgn.ac.in

Nipun Mahajan , IIT Gandhinagar, mahajan.n@iitgn.ac.in

Shrreya Singh , IIT Gandhinagar, singh.shrreya@iitgn.ac.in

Repo


Visualising The Data Points

A sample of 100 points is taken from the dataset for the visualisation. Each row in our dataset represents a unique paper published in a particular year. The attributes in the dataset used are: [Conference, Year, Title, Author, Affiliation] . To visualise the data in a 3D perspective, each paper is identified by the attributes: [Conference, Year, Author] . For a research paper having the same attributes, color is used to distinguish between such instances. Therefore, each point $(p_{id})$ in space indicating a published paper, is presented as: $$p_{id}= f(conference, year, author)$$

The above interactive slider displays the contribution of top 10 organisations in terms of number of research papers published (centered around machine-based learning) in a certain year.

The notion of 'top organisations' in our analysis is subject to the cummulative number of papers published over the years. For our dataset available till only the year 2021, we have retrieved the top 10 organisations based on the number of papers published in the duration: 2006-2021. However, due to methodological constraints, the data corresponding to the year 2021 is not complete.

We analyse the number of research papers published to abstract the contribution of certain organisations in the mentioned field. Charting such a trend helps to analyse the behaviour of the current top 10 institutes over the years. For example, DeepMind , a current top institute in the research field of Deep Learning had no publications in thye year 2006 . However, as the demand and attention towards Deep Learning increased after 2014, the number of publications sore to 301 in the year 2019 . This paradigm shift was a response to the growing demand for faster algorithms for deep learning and related techniques after the year 2014.

Interestingly, the year 2020 witnessed the highest number of cumumulative research papers across the top 10 affiliated institutes. The trend explains the boost in current demand of the technology as every manufacturing sector depends on machine-based learning to increase responsiveness.

The growth this 'growth' trend spoken-of numerous times is evident as the average number of publications increases tremedously from 25 in the year 2014 to about 290 in the year 2020 .


Dataset: 

The dataset contains all paper titles, authors and their affiliations from the years

ICML Conference: 2017-2020
NeurIPS Conference: 2006-2020
ICLR Conference: 2018-2021 (except 2020)

This is a distribution of the top 10 authors who published maximum number of papers since 2006 (upto 2021). Sergey Levine, Pieter Abbeel and Michael Jordan are associated with UC Berkely, Yoshua with University of Montreal and Lawrence Carin with Duke University. According to current treds, America is leading research in AI and ML in terms of number of papers are published. From the above distribution, it can be observed that six out of ten authors are associated with a USA institute which justifies the current trends.

Dataset: 

The dataset contains all paper titles, authors and their affiliations from the years

ICML Conference: 2017-2020
NeurIPS Conference: 2006-2020
ICLR Conference: 2018-2021 (except 2020)

This is a plot of total number of papers published vs how many authors published those many papers. Each colour represents differet number of papers. All the authors' performance can be considered independent over here. As observed, this doesn't follow the Central Limit Theorem and as the paper count increases, frequency of authors decreases drastically.

Note: To run the below cells, unzip papers.csv.gz present in the data folder and rename it as papers.csv

Dataset: 

This dataset contains the year of publication, title, author details, abstracts, and full text from the years

NeurIPS Conference: 1987-2019.

Over 33 years, the research domains in AI has changed significantly. This can be studied by analysisng the top keywords in papers published over the years. The observation in this report is carried out on top 10 such keywords of randomly and uniformly selected 825 papers published in NeurIPS conference from 1987-2019. The keywords were extracted using built-in algorithm of TD-IDF in Scikit-learn library. The stopwords were ignored and only relevant keywords were considered during training. The vocabulary was created by learning on 8852 papers. From these word clouds, we propose to study the growth of AI and tools over the year.

Initially, neurons were one of the only widely researched areas in the AI world. Most of the papers focussed on training their neural networks or worked on efficiently developing their neurons . Since 1995, rapid development was witnessed in Learning Classifier System algorithms and its applications. Over the next years, papers focussed on these topics as can be seen from the top trending words like classifier , cell , learning .

In the 21st century, graph, tree and deep neural networks also gained popularity in identifying features and extracting relationships between nodes. As a result, words like graph matrix tree deep were popularly used in papers. Words like regularization were also observed in between. Some words like cluster learning kernel remain popular throughout these years as clustering data and learning on it are still some of the basic data analysis tasks.

Some noises can be induced in the above observations but it gives a good idea about the flow of research areas over the years.


Analyzing research paper titles

One of the basic steps of any data analysis task is the representation of objects in a machine-understandable format. The title of a research paper is the first interaction between the authors and readers. Ideally, the title of the paper captures the precise research field to which the paper contributes. The titles are converted to vectors using the Word2Vec and Doc2Vec algorithms to be able to perform computations on them. The cosine similarity between fifty randomly sampled titles was computed and the vectors with similarity greater than the threshold indicated by the slider have been plotted. At the threshold similarity of 0.6, a cluster of titles on the topics of unsupervised learning and meta-learning is observed. As the similarity threshold goes higher, the number of connections between titles decreases as we have plotted just fifty samples and not all the titles.

After observing the similarity between the titles, the Mini-Batch K-Means Clustering algorithm is performed to cluster the titles based on their vector representations and cosine similarities. Although the fields of ML and AI are growing richer by the day, with researchers delving into more niche problems, the number of clusters was restricted to be twenty. This was done for the ease of understanding the visualisations. After clustering, the most popular terms present in each cluster were extracted so as to gain an understanding about the label of the cluster. The popular terms such as in cluster 7 are graph network recurrent training convolutional which indicates that the papers clubbed together in this cluster are based on training models on convolutional and recurrent networks. Cluster 12 is captured by the terms faster parallel greedy fast regularized , implying that the papers in this cluster focus on optimising the solutions by using parallel computing and algorithms with faster time complexity.

Having analyzed the semantics of the clusters formed, they are now visualized to better understand the trends over the years. Since the title vectors are 100-dimensional which cannot be easily visualized, we perform Principal Component Analysis to capture the huge number of dimensions onto two axes. The title vectors in 2-dimensions, coloured according to the cluster labels, have been plotted. To better understand this plot, the points have been plotted year-wise in the following animation. The colour of the bubbles indicates the cluster to which the paper belongs and its size is controlled by the number of papers falling in that cluster in that year. As the slider progresses over the years, we can see a clear increase in the number of papers published in general (except in the year 2021 as the database was created midway through the year on 20th June). The cluster consisting papers on attention understanding aware uncertainty nets (light blue) increases in size in the year 2017. This observation aligns with the fact that the ground-breaking 'Attention Is All You Need' paper was published in the same year. In the clusters representing faster parallel greedy fast regularized (dark green) and scalable estimating application partial distributions (light green), we see a sudden rise in the size of the bubble in the year 2018. This is reasonable as during this phase in AI, many algorithms and models had been proposed and focus shifted on making them optimizied and faster. With the increase in cloud technologies, applications involving parallel and distributed computations were proposed. The clusters of papers on human scalable joint partial embedding (dark orange) and graph network recurrent training convolutional have been around since 2006 and show a gradual increase in the number of publications over the years. This seems reasonable since convolutional and recurrent networks have been proposed in the early 2000's and have been popular as baseline models since then. The terms 'human' and 'scalable' are also at the heart of this field since humans try to emulate the neurons in our brains into computation models and hope to run them on large-scale data to infer meaningful results. The overall trends described above and increasing bubble sizes are a strong indication of the rapid growth in various niche areas of AI.